Introduction to Data Science with R

Session 1: Welcome!

Ina Bornkessel-Schlesewsky

October 25, 2023

What is Data Science?

Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge.

(Wickham, Çetinkaya-Rundel, and Grolemund 2023) Henceforth: R4DS

freely available online

What is Data Science?

StackExchange Data Science user Stephan Kolassa CC BY-SA 4.0 via Wikimedia Commons

What is Data Science?

‘Hal Varian, the chief economist at Google, is known to have said, “The sexy job in the next 10 years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?” If “sexy” means having rare qualities that are much in demand, data scientists are already there. They are difficult and expensive to hire and, given the very competitive market for their services, difficult to retain. There simply aren’t a lot of people with their combination of scientific background and computational and analytical skills.’

I’m not looking for a sexy new job. Why is this relevant for me?


In academic research, across a wide range of disciplines, we’re also interested in turning “raw data into understanding, insight, and knowledge”, as well as in communicating our results!

What you will learn here

  • how to gain insights from data using contemporary computational tools
  • basic programming skills in an open source programming language (i.e. R)
  • how to produce reproducible reports (good for science and good for you!)
  • how to use online repositories such as GitHub or the Open Science Framework to share data and code

You will also develop an understanding of how these tools help to foster open science, reproducible research and thus the ethical treatment of data.

These skills are readily generalisable across a wide range of domains.

So is this just another stats course?


Apart from the fact that we’re using R?


“Oh no you didn’t” gif by happydog from https://giphy.com

Not just another stats course …


Our focus will be on

  • understanding data rather than statistical tests per se (though they may come up in passing)
  • philosophy / workflow rather than “results”
  • (moral of the story: it’s not just about statistical significance!)

Not just another stats course …


You will be introduced to a set of tools and workflow that

  • foster good practices in dealing with data (i.e. we try to draw the best insights we can from a dataset)
  • foster open science (i.e. we share our data and “show our work”, which is good for science and for sharing knowledge)
  • are economical and reproducible (i.e. we avoid doing stuff by hand and can repeat what we did)

Speaking of workflow


from R4DS
  • import data (into R)
  • tidy data: bring it into a consistent format that can be used for multiple purposes (each column = variable; each row = observation)
    • lets you focus on understanding the data rather than which format you need

Speaking of workflow


from R4DS
  • transform data
    • e.g. focus on observations of interest, create new variables, compute summary statistics
  • visualise data
    • essential for understanding
  • model data
    • use (statistical) models to answer your questions about the data
  • communicate insights

Let’s give it a go!

Penguins!

Artwork by @allison_horst
  • data on penguins from the Palmer Archipelago in Antarctica

Penguins data


species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

Which penguins are largest?

Which penguins are largest?

Which penguins are largest?

This must be tricky, right?


  • You will be able to generate (simple) publication-quality figures not unlike these by the end of our next session

Data exploration exercise

Explore the Palmer Penguins data

  • Open this web app: https://ibsneuro.shinyapps.io/palmer_penguins/

  • Tab 1 contains information about the data set and lets you inspect the data frame

  • Tab 2 allows you to generate plots by selecting the type of graph, which variables to put on the x and y axes and which variable to group by (using different colours)

  • In your exploration, consider the questions on the following slide

  • For each question, note down not only your answer but also the strategy you chose to get to it: how did you choose to construct your graph for the question and why?

Explore the Palmer Penguins data

  • If you wanted to predict a penguin’s body mass, which other attributes could you look at (e.g. flipper length, bill length, sex etc.)? In other words, which of the other attributes appear to be most predictive of body mass?

  • Is there a close relationship between bill length and bill depth?

  • Is it possible to look at effects of island (i.e. the environment in which the penguins live) independently of other factors such as species or sex? If not, why not?

So how does this work?

Enter R and RStudio (Posit)

  • R is a programming language for statistical computing (but it can also be used for other things)
  • RStudio (Posit) is an integrated development environment (IDE) for R, which is a fancy way of saying that it provides a convenient platform within which we can use R

Posit Cloud

  • For (the first part of) this workshop, we will be using Posit Cloud, which provides a web-based version of RStudio
  • This means that you won’t have to install anything on your computer and that you will have direct access to all of the materials that I have prepared
  • Go to https://posit.cloud and create a login if you haven’t already
  • Once you have done this, use the link that you were emailed to access the uoc_data_science_2023 workspace on Posit Cloud and select the project data_science_2023
  • When you access the project, you will receive your own copy of it to work on

Exercise: Palmer Penguins

  • go to the Files tab in the lower right pane of your RStudio Cloud project
  • go to exercises > 01a_penguins
  • click on the document 01a_penguins.qmd
  • this is a Quarto document which mixes text and code and is an excellent format for reproducible research reports, as we will see later - don’t worry too much about the code for now; this is what the document looks like when rendered

Exercise: Palmer Penguins

  • click on the Render button at the top of the document to produce the rendered version
  • have a read-through and look at the figures
  • there is also an interactive table to remind you of what the data look like (they’re the same as in the web app that you interacted with earlier)
  • suggestion: set option “Preview in viewer pane” to avoid problems with pop-up windows

Your turn!

For each of the following challenges, go back to the raw document (i.e. the one that doesn’t look pretty 😄), try to figure out how to make the relevant change and then render the document using Render to see whether you were correct!

  • For the relationship between bill length and depth, change Gentoo to Adelie; check the figure to see if it worked
  • Change the outcome variable in the histogram from body mass to something else and observe what happens. Remember that you can go back to the table at the top of the rendered document to have a look at the available variables. Note that the figure title will only change if you also adapt the text in “title”. (Hint: you may need to change the bin width! What would make sense for the outcome variable you have chosen?)
  • Look at the effect of body mass by island rather than species. What do you see?
  • Look at flipper length by sex rather than species
  • Look at the effect of body mass by island rather than species. What do you see?
  • Look at flipper length by sex rather than species

Finally: a few more details on the course

Assessment

Report

  • Due at the end of the course / semester (I need to check on the formalities)
  • Create a reproducible report on a data science project of your own choosing
  • More details to come

Weekly exercises

  • There will be weekly exercises to allow you practise / reflect on / extend the content for each week
  • These will not be assessed (and need not be submitted), but you are strongly encouraged to complete them from week to week, as they will help you to consolidate the content learned

Course schedule 1

October to December

Week Date Topic
1 25/10/2023 Welcome and introduction
N/A 01/11/2023 No workshop (public holiday)
2 08/11/2023 Basic data exploration and data viz
3 15/11/2023 Data exploration part 2
4 22/11/2023 Importing data (+ installing R)
5 29/11/2023 Tidying data
6 06/12/2023 Working with text (“strings”)

Course schedule 2

January

  • Zoom sessions (dates and times to be negotiated)
  • Suggested topics:
    • version control with git and GitHub
    • other (more advanced) topics by negotiation
    • consultation sessions for your report

Week 1 exercises

  1. If you wanted to predict a penguin’s body mass, which other attributes could you look at (e.g. flipper length, bill length, sex etc.)? In other words, which of the other attributes appear to be most predictive of body mass? In addition to your answer, briefly note down how you went about answering this question.

  2. Is there a close relationship between bill length and bill depth? Again, please describe briefly how you went about addressing this question. Also consider whether the answer might be a bit more complex than just “yes” or “no”.

  3. Is it possible to look at effects of island (i.e. the environment in which the penguins live) independently of other factors such as species or sex? If not, why not?

  4. Explore another 2 or 3 questions that interest you. Briefly describe these questions and what you found below.

  5. Reflect on what the exercise (using the web app) has shown you regarding the use of different graph types to address different questions.

  6. What were your first impressions when interacting with the RMarkdown document? For example … Were you able to make the changes requested in the exercise? What did you find easy? What did you find challenging?

If you have questions

Please email me: Ina.Bornkessel-Schlesewsky@unisa.edu.au

Hopefully, I will also have access to my UoC login details soon, allowing me to send out any announcements in regard to the course.

In the meantime, you can check the course website for any updates:

References

Gorman, Kristen B., Tony D. Williams, and William R. Fraser. 2014. “Ecological Sexual Dimorphism and Environmental Variability Within a Community of Antarctic Penguins (Genus Pygoscelis).” Edited by André Chiaradia. PLoS ONE 9 (3): e90081. https://doi.org/10.1371/journal.pone.0090081.
Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. “Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data.” https://allisonhorst.github.io/palmerpenguins/.
Wickham, Hadley, Mina Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. O’Reilly Media, Inc. https://www.oreilly.com/library/view/r-for-data/9781492097396/.